Bilingual-LSA Based LM Adaptation for Spoken Language Translation

نویسندگان

  • Yik-Cheung Tam
  • Ian R. Lane
  • Tanja Schultz
چکیده

We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to the target language N-gram LM via marginal adaptation. The proposed framework also enables rapid bootstrapping of LSA models for new languages based on a source LSA model from another language. On Chinese to English speech and text translation the proposed bLSA framework successfully reduced word perplexity of the English LM by over 27% for a unigram LM and up to 13.6% for a 4-gram LM. Furthermore, the proposed approach consistently improved machine translation quality on both speech and text based adaptation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bilingual LSA-based translation lexicon adaptation for spoken language translation

We present a bilingual LSA (bLSA) framework for translation lexicon adaptation. The idea is to apply marginal adaptation on a translation lexicon so that the lexicon marginals match to indomain marginals. In the framework of speech translation, the bLSA method transfers topic distributions from the source to the target side, such that the translation lexicon can be adapted before translation ba...

متن کامل

Joint and Coupled Bilingual Topic Model Based Sentence Representations for Language Model Adaptation

This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional approaches have gained significant performance, they ignore the topic information and the distribution information of words when selecting similar training senten...

متن کامل

Neural Network Based Bilingual Language Model Growing for Statistical Machine Translation

Since larger n-gram Language Model (LM) usually performs better in Statistical Machine Translation (SMT), how to construct efficient large LM is an important topic in SMT. However, most of the existing LM growing methods need an extra monolingual corpus, where additional LM adaption technology is necessary. In this paper, we propose a novel neural network based bilingual LM growing method, only...

متن کامل

Efficient language model adaptation for automatic speech recognition of spoken translations

Direct integration of translation model (TM) probabilities into a language model (LM) with the purpose of improving automatic speech recognition (ASR) of spoken translations typically requires a number of complex operations for each sentence. Many if not all of the LM probabilities need to be updated, the model needs to be renormalized and the ASR system needs to load a new, updated LM for each...

متن کامل

Anticipatory Translation Model Adaptation for Bilingual Conversations

Conversational spoken language translation (CSLT) systems facilitate bilingual conversations in which the two participants speak different languages. Bilingual conversations provide additional contextual information that can be used to improve the underlying machine translation system. In this paper, we describe a novel translation model adaptation method that anticipates a participant’s respon...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007